Facilitating the Reproducibility of Scientific Workflows with Execution Environment Specifications
نویسندگان
چکیده
Scientific workflows are designed to solve complex scientific problems and accelerate scientific progress. Ideally, scientific workflows should improve the reproducibility of scientific applications by making it easier to share and reuse workflows between scientists. However, scientists often find it difficult to reuse others’ workflows, which is known as workflow decay. In this paper, we explore the challenges in reproducing scientific workflows, and propose a framework for facilitating the reproducibility of scientific workflows at the task level by giving scientists complete control over the execution environments of the tasks in their workflows and integrating execution environment specifications into scientific workflow systems. Our framework allows dependencies to be archived in basic units of OS image, software and data instead of gigantic all-in-one images. We implement a prototype of our framework by integrating Umbrella, an execution environment creator, into Makeflow, a scientific workflow system. To evaluate our framework, we use it to run two bioinformatics scientific workflows, BLAST and BWA. The execution environment of the tasks in each workflow is specified as an Umbrella specification file, and sent to execution nodes where Umbrella is used to create the specified environment for running the tasks. For each workflow we evaluate the size of the Umbrella specification file, the time and space overheads of creating execution environments using Umbrella, and the heterogeneity of execution nodes contributing to each workflow. The evaluation results show that our framework improves the utilization of heterogeneous computing resources, and improves the portability and reproducibility of scientific workflows.
منابع مشابه
A Clustering Approach to Scientific Workflow Scheduling on the Cloud with Deadline and Cost Constraints
One of the main features of High Throughput Computing systems is the availability of high power processing resources. Cloud Computing systems can offer these features through concepts like Pay-Per-Use and Quality of Service (QoS) over the Internet. Many applications in Cloud computing are represented by workflows. Quality of Service is one of the most important challenges in the context of sche...
متن کاملReproducibility of execution environments in computational science using Semantics and Clouds
In the past decades, one of the most common forms of addressing reproducibility in scientific workflow-based computational science has consisted of tracking the provenance of the produced and published results. Such provenance allows inspecting intermediate and final results, improves understanding, and permits replaying a workflow execution. Nevertheless, this approach does not provide any mea...
متن کاملUsing Cloud-Aware Provenance to Reproduce Scientific Workflow Execution on Cloud
Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. This provenance of scientific workflows has been effectively carried out in Grid based scientific workflow systems. However, recent adoption of Cloud-based scientific workflows present an opportunity to investigate the suitability of existing approaches or propose new approaches to collect p...
متن کاملApproaches to Distributed Execution of Scientific Workflows in Kepler
The Kepler scientific workflow system enables creation, execution and sharing of workflows across a broad range of scientific and engineering disciplines while also facilitating remote and distributed execution of workflows. In this paper, we present and compare different approaches to distributed execution of workflows using the Kepler environment, including a distributed dataparallel framewor...
متن کاملOn the Use of Abstract Workflows to Capture Scientific Process Provenance
Capturing provenance about artifacts produced by distributed scientific processes is a challenging task. For example, one approach to facilitate the execution of a scientific process in distributed environments is to break down the process into components and to create workflow specifications to orchestrate the execution of these components. However, capturing provenance in such an environment,...
متن کامل